Overview

The Secure Data Commons (SDC) provides access to novel data sets within the Department of Transportation for researchers and partner agencies. The SDC provides a platform for analysis efforts which integrate these novel data sets. To date, this ‘Commons’ aspect of the Secure Data Commons has not been fully utilized; this research project demonstrates an approach to integrating multiple data sets in the SDC. This document summarizes the work of the SDC Cross-project team in assessing how two of the SDC data sets can be usefully analyzed together.

Two data sets are used in this work:

  • FMI Data: The Freight Mobility Initiative (FMI) data is provided via an agreement between the American Transportation Research Institute (ATRI) and the Bureau of Transportation Statistics (BTS) of the USDOT. This data set includes GPS locations, speed, and direction of travel for anonymized trucks.

  • Waze Data: The mobile phone navigation app company Waze has been providing data on crowdsourced traffic incident reports and additional system-generated data via an agreement with the DOT. These data include location and time of incidents, along with a number of characteristics of incidents, as well as location, extent, and speed of jams.

Data set characteristics:

Characteristic FMI Waze Notes
Spatial Range National National Waze data organized with state partitions
Available Time Range Since October 2018 Since March 2017 Gaps exist for some time periods
Volume of data 106 B GPS records 1.6 B alert records
3.4 B jam records
11 B jam point sequence records
As off late April 2020
Update frequency Bi-weekly API request every 2 minute Both have pipeline processing times, so not available to analysts immediately

Main takeaways

  • The data can be usefully integrated together. The two data sets provide complementary information on volume and characteristics of crowdsourced traffic incidents, and volume and characteristics of commercial trucking activity. Together, these data could be applied to a number of use cases in roadway safety, roadway system performance, and travel monitoring.

  • Through this effort, we developed general-purpose query and data processing scripts in GitLab which can be used by other teams. The code collaboration tools in the SDC provide a means for future analysts to replicate and build on the analysis presented here. This effort leveraged some existing work by other teams (BTS) who had started exploring the FMI data, as well as code from the Safety Data Initiative Waze pilot project.

  • Successfully integrating these two high-volume, high-velocity data sets requires a familiarity with the characteristics of the data, the data warehousing system, and best practices for geo-spatial analysis. Documentation exists in the SDC to provide much of this necessary information, but future users of these data would benefit from documentation which goes into more detail

  • The spatial and temporal patterns of the FMI and Waze data are distinct.

    • There are locations and times where there is high truck activity but low Waze activity, likely distribution centers and commercial shipping routes.
    • The converse is true; Waze data volume is highest at commuting hours, in more densely populated areas.
    • However, there is useful areas of overlap between these two data sets.

Methods

Given the vast scale of these data sets, this assessment selected one region of the country, for one month, namely eastern Massachusetts for September 2019. This allows a proof-of-concept assessment of how these data can be joined and analyzed in concert. The process involved the following steps:

Accessing the data and requisite skills

The SDC support team provides a detailed user guide (available publicly (here)[https://its.dot.gov/data/secure/files/Secure_Data_Commons_Data_Analyst_User_Guide.pdf]) for data analysts. This guide, along with instructions provided by the SDC support team, is sufficient for users to understand how to log on and access these data sets. Additional requisite skills include the following:

  • Working with data in the SDC requires at a minimum some familiarity with SQL (structured query language). SQL is a general-purpose query language familiar to users of any relational database.
  • Efficient use of the data is greatly facilitated by familiarity with a modern data science scripting language such as R or Python. These open-source languages have a wide variety of sophisticated data science tools, including machine learning, geospatial analysis, time-series modeling, data integration, and data visualization. The SDC provides support for both R and Python, on the default Windows machines as well as on optional Linux machines.
  • Code collaboration and version control is supported in SDC with a GitLab server. This allows analysts to follow modern data science best practices by developing code in collaboration with teammates, keeping working versions of code available while developing new features.The data analyst user guide provides instructions on how to enable GitLab code repositories, but analysts need to bring their own knowledge of how to effectively set up and manage the repositories for their needs.

Accessing the data

Querying the data warehouse

Both the Waze and FMI data can be queried using standard SQL syntax. The Waze data are warehoused in an Amazon Redshift cluster, while the FMI data are warehoused in a Hive/Hadoop cluster. In practice, the query syntax is similar for both: query the database for a date range and general spatial area, then perform more detailed spatial analysis on the results from the query. In the course of developing the code for this project, we encountered performance issues with the Hive/Hadoop cluster. We worked with the SDC developer team to test improvements to the database structure, which the SDC team has implemented.

The ‘two-step’ query process (first query a general area in SQL, then perform more detailed spatial data processing on the results) could in theory be simplified to one step. However, this would involve putting much more computational load on the data warehouse infrastructure, as opposed to on a data analyst’s own EC2 compute instance in the two-step process. The first spatial step is similar for both: querying for a date range (2019-09-01 to 2019-09-30) and a latitude and longitude range that encompasses eastern Massachusetts (or any region of interest). The second step is also the same: convert to spatial data frames, apply the same geographic projection on the latitude and longitudes, and clip to a shapefile for the state of Massachusetts using standard, open-source geospatial analysis libraries.

Spatial joining

Joining two spatial datasets involves decisions on the part of the analysis team; the methods depend on the needs of the analysis. The two most meaningful options for this project were to join to road segments, or to join to a common grid structure. These two approaches are summarized below:

  • Road segment
    • Joining points to road segments result in one row per road segment, with attributes for number and characteristics of commercial motor vehicles, and number and characteristics of traffic alerts. Each time step would add one additional row for each road segment.
    • The Highway Performance Monitoring System (HPMS) provides an road network for the National Highway System roads. While convenient to use for spatial analyses, it is limited by not including most lower functional class roads.
    • Each state maintains their own detailed road network. It would be possible to access a road network for this study area (Massachusetts), but the methods of spatial joining would need to be adapted to apply to other states.
    • Any analyses need to take into account the different lengths and characteristics of the road segments; correctly joining points to road segments requires careful work to assign points correctly to the right lanes and directions (applicable to higher functional class roads).
  • Grid or Polygon
    • Joining points to road segments result in one row per grid cell or polygon, with attributes for number and characteristics of commercial motor vehicles, and number and characteristics of traffic alerts. Each time step would add an additional row for each grid cell or polygon.
    • Joining to a county, census block, or other political boundary is straightforward and would facilitate use of other characteristics of that political unit, including population or socio-economic data like the Longitudinal Employer-Household Dynamics (LEHD) data from the Census Bureau.
    • Joining to a grid cell is likewise straightforward. By providing an equal-area grid cell to join to, spatial analysis is greatly facilitated (as area does not need to be used as a co-variate).

For most safety applications, it is ideal to work at the road segment level, since that is where safety interventions can be applied. For our initial work, we elected to use a grid to facilitate rapid analysis for the purpose of comparing the two data sets in conjunction with each other. We joined all point data (FMI and Waze) to grid layer of 1 square-mile hexagons. Hexagonal tessellations are appropriate because they better represent underlying linear features like roads, while minimizing the data artifacts associated with edge areas (ESRI reference). The area of eastern Massachusetts selected here encompasses 3,468 square miles. For the month of September 2019, data on 5,710,425 unique trucks by hour were included from the FMI data, as well 460,165 unique Waze events by hour.

Code for this project was written in SQL, Python, and R, and is version-controlled in the FMI_Waze repository on the GitLab service within SDC.

Results

Mapping

The following maps show aggregated counts of unique truck IDs per hour and counts of Waze events per hour. The spatial unit is 1-square mile hexagonal grid cells. A road segment approach using the HPMS network or a more detailed state-provided network is possible in the future.

These maps further aggregate the count of each to daytime (7 am to 7pm, Eastern) and weekend/weekday time periods for ease of comparison.

Weekdays

Comparing daytime to nighttime, the FMI data show consistent patterns but less total volume in the evening. The mean count of unique truck IDs per hour ranges as high as 106, and high values are consistently observed along the interstate 495 and interstate 90 routes (Google Map of the area). While interstates appear prominently in the FMI data, there is even coverage across the entire study area for presence of trucking activity.

When clicking the Waze Alert Counts tab, a strikingly different pattern emerges. Note the scale differs between the FMI and Waze tabs. The dense cluster of Waze alert activity in the metro Boston area is maintained in both daytime and nighttime, but much less activity is observed in the nighttime.

FMI Truck Counts

Waze Alert Counts

Weekend

The patterns on the weekends are similar for both FMI and Waze. Note that the scale is the same for the weekday plots within each data set (FMI and Waze), but remains different across data sets. The FMI data show distinct patches of high activity on weekends, which may represent distribution centers or other warehouses. Waze activity is reduced on weekends, and distinct patches of high activity likewise emerge.

FMI Truck Counts

Waze Alert Counts

FMI Truck Speeds

Note these maps compare just weekday and weekend time periods for mean speeds of unique truck IDs within a grid cell. The mean speed of a given truck ID was first calculated; these values are the means of those mean values by truck ID. Truck speeds are generally lower in the Boston metro area, with the interstate areas contributing some of the highest speed. Speeds generally retain the same characteristics on weekdays and weekends.

Weekday

Weekend

Volume relationships

We additionally assessed the relationship between the two data sets in terms of the volume of data. A high correspondence in data volume indicates that when analyzed by equal-area grid cell and hour, generally high activity in one stream of data is matched by high activity in the other. Each point in the figures below represents one grid cell. Values are the sum of distinct truck IDs or unique Waze alerts for the specified time frame (weekend or weekday, daytime or nighttime) in that grid cell.

Since both axes represent highly skewed data (few, very large values), it is also useful to view these relationships on a logarithmic scale. This demonstrates close correspondence between the volume of FMI and volume of Waze alert data when accounting for the distribution of the values.

Waze jams and truck speeds

The following plot and table demonstrate that the two data sets can be combined in useful ways. When at least 3 jam reports are present in the Waze data, there is a dramatic drop in the speeds of trucks in the FMI data. The boxplots below group truck speeds into five categories, and the y-axis represents the mean count of distinct truck IDs per grid cell for that time period and number of jams. Note that ‘low jams’ represents most of the data, 92,675 grid cells x weekend/weekday time period, while ‘high jams’ is a less common condition, only 8,465 grid cells x weekend/weekday time period.

The same data are summarized below in tabular form for quick reference.

Summary of FMI and Waze data by weekend/weekday and high Waze jam counts
weekend high_jams Number of grid cell x time combinations Median FMI truck speeds Sum FMI truck counts Median Waze jam reports Sum all Waze reports
Weekday High Jams 6,593 15 435,550 10 232,675
Weekday Low Jams 55,529 28 4,417,368 0 171,390
Weekend High Jams 1,872 31 26,228 6 25,568
Weekend Low Jams 37,146 34 831,279 0 42,885

Next Steps

Next steps will be considered in consultation with the project manager and stakeholders. We consider the following to represent potentially fruitful next steps:

  • Pandemic response
    • The FMI data represents an unprecedented opportunity for USDOT to monitor and understand commercial motor vehicle activity. The COVID-19 pandemic is a natural experiment on all aspects of society, but in particular is affecting the trucking industry in distinct ways, with some states seeing spikes at the outset of the pandemic, followed by dips in trucking activity of different depths (ATRI analysis).

    • Combining the Waze and FMI data would provide a way for the department to understand and monitor the distinctive behaviors of commuters versus commercial motor vehicles as our country recovers from the pandemic.

  • Travel time / Mobility / Energy
    • System-generated jam files from Waze include the spatial extent as well as severity of the jam. Looking at the influence of jams on freight travel from FMI could be fruitful.

    • Fuel efficiency and truck platooning are possible to investigate using the FMI data, since unique truck IDs are available with GPS pings at high frequency.

  • Safety
    • Crash predictability (with police crash report data). The Waze data has been successfully used to understand and predict crash frequency at broad spatial scales. Combining both Waze and FMI could support a crash prediction tool for commercial motor vehicles better than using either one of the data sets alone.
  • Assessment of data and validation
    • WYDOT Speed data to NPM-RDS. The Connected Vehicle Pilot in SDC has detailed vehicle speed data from Wyoming Department of Transportation. These data could be usefully validated against speed profile data for NHS road segments in the National Performance Management Research Data Set (NPM-RDS) from FHWA.

    • TMAS to FMI. The Traffic Monitoring Analysis System (TMAS) from FHWA provides ground-truthed traffic count and vehicle classification data at over 8,000 locations nationwide. Joining TMAS and FMI data would allow an understanding of the percent of all trucks which are represented in the FMI data by space and time.

    • TMAS to Waze (under SDI). The Safety Data Initiative is now undertaking an analysis of the Waze data in the context of the TMAS data as noted above for FMI. This will provide confidence intervals for how much of the traveling public is represented in the Waze data.